Data conversion from GenomeStudio to PLINK format was carried out prior to data delivery.
Module 1 prepared data for QC procedures. In the exported data from GenomeStudio, samples were coded using project specific retrieval IDs and non-informative family IDs. Also, no sex information was available.
Information about declared sample sex and pedigree structure provided by MoBa were used to update the dataset.
Markers with poor cluster separation, low 10% GC score and high AA theta deviation were removed. Clustering metrics were provided by the SNP table exported from GenomeStudio.
The array contains some duplicated/triplicated markers. Duplicates/triplicates were removed to avoid potential problems downstream.
Illumina provides a conversion list for converting marker IDs used on the array to rsIDs. In the provided list some markers are named ‘.’. To avoid duplicate/non-informative IDs, the original chip ID was used in situations were no rsID was available.
Illumina technical markers (assigned to chromosome 0) were excluded and markers with poor clustering were eliminated using metrics available from the variant table exported from GenomeStudio. Problematic markers reported by other consortia were subsequently excluded.
NOTE: All markers in the PAR region was correctly assigned to chr 24, no update necessary.
NOTE: The lab in Rotterdam removed 7710 markers showing poor performance prior to delivery (markers not included in the dataset)
Updating sample IDs:
Samples in: 17949
Samples updated: 17949
Samples not updated: 0
Updating parental information:
Samples in: 17949
Samples assigned one or more parents: 5984
Samples not updated: 11965
Updating sex:
Samples in: 17949
Samples where sex was updated: 17949
Samples not updated for sex: 0
Remove markers by cluster separation < 0.4:
Markers in: 692367
Markers removed: 18154
Markers remaining: 674213
Remove markers by 10% GC score:
Markers in: 674213
Markers removed: 19915
Markers remaining: 654298
Remove markerst by AA theta dev:
Markers in: 654298
Markers removed: 4123
Markers remaining: 650175
Remove duplicated markers:
Markers in: 650175
Markers removed: 480
Markers remaining: 649695
Update SNPs to rsID:
Markers in: 649695
Markers updated: 338017
Markers not updated: 311678
Removed technical markers (chr0):
Markers in: 649695
Markers removed: 1910
Markers remaining: 647785
Module 2 carries out temporary cleaning before checking sex of all samples in the dataset. IBD estimation is performed in order to correct any pedigree inconsistencies. A corrected .fam file containing corrected pedigree structure is used for the remainder of the QC. In an attempt not to lose any samples of good quality, all samples (with genotype missingness < 5%) are sent to phasing and imputation. However, samples that could not be reliably identified in the pedigree are flagged in the flaglist.
To identify an ethniccally homogenous set of samples before module 3, PCA is performed for the HapMap samples and the MoBa samples are projected onto the plot before manually selecting an ethinically homogeneous dataset for use in marker cleaning. Note that using projections will include some ethnic HapMap samples in the selection plots compared to running PCA on the complete merged dataset of MoBa samples with HapMap.
The dataset is then split into founders and offspring and samples with excess relatedness are removed.
The input file A with pre-solved problematic pedigrees contained 7 resolved families with 20 index individuals. The input file B with pre-identified problematic samples (often accidental duplicates of other samples) contained 65 individual IDs.
¤### Upstream (dynamic) input files Genetic files with 17742 individuals reached this module. The original .fam file listed 5815 fathers (V3 column) and 5832 mothers (V4 column) who had genotypes (i.e., were listed in V2 column); also 66 fathers and 58 mothers without genotypes. The final .fam file will have them reset as missing (respective numbers 0 and 0).
The thresholds used to identify paren-offspring relationship were \(Z1>\) 0.8; twin or dublicated samples - \(PI\_HAT\ge\) 0.8; full-siblings relationship - \(Z1\ge\) 0.35, \(Z1\le\) 0.65, \(PI\_HAT\ge\) 0.35, \(PI\_HAT\le\) 0.65. The Y chromosome genotype count threshold used to separate males from females was \(YC>\) 92. The X chromosome \(F\) threshold used to separate females from males was \(F<\) 0.648. Genetic sex was inferred based on both criteria. When criteria disagreed (4 cases), samples were flagged as not suitable for analyses (phenotypeOK=FALSE) and genetic sex was infered from the X chromosome data.
Being found in the input file B, 20 samples in the .fam file got assigned their true family IDs, genetic parents and genetic sex. These samples are suitable for analyses and are not flagged as problematic. Being found in the input file A, 64 samples in the .fam file got assigned dummy family IDs (e.g. “prblm001”), got founder’s status (i.e., parental IDs were set to “0”) and their declared sex was set to their genetic sex. They were flagged as not suitable for future analyses (phenotypeOK=FALSE). In sex inference for all these updated samples, there were 0 cases where Y-chromosome and X-chromosome data did not agree (likely Klinefelter). The declared and inferred sex did not match in 35 samples.
The remaining .fam file contained 11549 declared parent-offspring relationships (5756 paternal, 5793 maternal). Genetic inferrence of the same data detected 11549 parent-offspring relationships and 0 pairs of dublicated (twin) samples. If there is a difference between declared and inferred numbers, the auto-generated .pdf report should be manually inspected to detect new sample-identity problems. In sex inference for all these samples, there were 4 cases where Y-chromosome and X-chromosome data did not agree (likely Klinefelter), and 4 cases where declared and inferred sex did not agree. These samples were flagged as not suitable for analyses (phenotypeOK=FALSE).
In total, the number of samples flagged as not suitable for analyses (phenotypeOK=FALSE) is 68. We do not trust the identity of these samples. The remaining 17674 samples were flagged as OK (phenotypeOK=TRUE).
The updated .fam file contained 11562 declared parent-offspring relationships (5762 paternal, 5800 maternal). The genetic (Xchr) sex was assigned to all the samples.
Markers and samples in
Markers in start: 647785
Samples in start: 17949
Temporary removal of markers with MAF < 10%
Markers in: 647785
Markers removed: 415378
Markers remaining: 232407
Permanent removal of samples with missingness > 5%
Samples in: 17949
Samples removed: 207
Samples remaining 17742
Temporary removal of markers with missingness > 2%
Markers removed: 232407
Markers removed: 3879
Markers remaining: 228528
Temporary removal of non-autosomal markers
Markers in: 228528
Markers removed: 8521
Markers remaining: 220007
Temporary removal of markers with HWE p < 1e-4
Markers in: 220007
Markers removed: 842
Markers remaining: 219165
Temporary removal of strand ambiguous markers
Markers removed: 219165
Markers removed: 1205
Markers remaining: 217960
Temporary remove markers in high LD
Markers in: 217960
Markers removed: 6338
Markers remaining: 211622
Prune set of markers using –indep-pairwise 200 100 0.1
Markers in: 211622
Markers removed: 171386
Markers remaining: 40236
Removal of pedigree inconsistent samples
Samples in: 17742
Samples for removal: 64
Samples removed: 64
Samples remaining: 17678
PCA after merge with HapMap
Markers after HapMap merge (used for PCA): 20308
Sample selection post PCA
Samples in: 17678
Samples removed after PCA: 785
Samples remaining after PCA: 16893
Split dataset into founders and offspring
Samples in: 16893
Founders: 11316
Offspring: 5577
IBD estimation
Samples in: 11316
Markers in: 647785
Remove samples with excess accumulated PIHAT:
Samples in: 11316
Samples for removal: 14
Samples removed: 14
Samples remaining: 11302
Remove one in a pair of samples with PIHAT > 0.1:
Samples in: 11302
Samples for removal: 478
Samples removed: 461
Samples remaining: 10841
IBD estimation
Samples in: 5577
Markers in: 647785
Remove samples with excess accumulated PIHAT:
Samples in: 5577
Samples for removal: 11
Samples removed: 11
Samples remaining: 5566
Remove one in a pair of samples with PIHAT > 0.1:
Samples in: 5566
Samples for removal: 90
Samples removed: 87
Samples remaining: 5479
Module 3 use the ethnically homogenous subset of samples in order to identify markers of good quality (for founders and offspring separately). Markers are removed iteratively with increasing strictening of exlucsions thresholds to ensure the highest quality markers and samples remain in successice steps. Note that MT and Y markers are not cleaned.
Number of markers and samples at start of cleaning:
Samples in start: 10841
Markers in start: 647785
Remove markers with missingness > 10%:
Markers in: 647785
Markers removed: 288
Markers remaining: 647497
Remove individuals with missingsness > 5%:
Samples in: 10841
Samples removed: 1
Samples remaining: 10840
Remove markers with missingness > 5%:
Markers in: 647497
Markers removed: 1417
Markers remaining: 646080
Remove individuals with missingess > 3%:
Samples in: 10840
Samples removed: 19
Samples remaining: 10821
Remove markers with missingness > 2%:
Markers in: 646080
Markers removed: 8322
Markers remaining: 637758
Remove individuals with missingness > 2%:
Samples in: 10821
Samples removed: 18
Samples remaining: 10803
Remove autosomal markers with HWE p < 1e-7:
Markers in: 637758
Markers removed: 1659
Markers remaining: 636099
Remove samples with HET excess > 4SD using common autosomal markers (MAF > 0.01)
Samples in: 10803
Samples removed: 1
Samples remaining: 10802
Remove autosomal markers with HWE p < 1e-6
Markers in: 636099
Markers removed: 208
Markers remaining: 635891
Remove samples with HET excess > 4SD using rare autosomal markers (MAF > 0.01)
Samples in: 10802
Samples removed: 61
Samples remaining: 10741
Remove markers with missingness > 2%:
Markers in: 635891
Markers removed: 3
Markers remaining: 635888
Temporarily remove samples failing sex check (F: 0.2, 0.8):
Samples in: 10741
Samples for removal: 5
Samples removed: 5
Samples out: 10736
Markers into sex clean:
X markers in: 15099
Y markers in: 712
PAR markers in: 564
MT markers in: 126
Remove chrX and PAR markers with HWE p < 1e-6 (only female):
Markers (X + PAR) in: 15663
Markers removed: 29
Markers remaining: 15634
Remove chrX marker if any male has at least one heterozygote genotype:
Markers removed: 694
Markers remaining 635165
Markers after sex clean:
Autosomes markers out: 619387
X markers out: 14404
Y markers out: 712
PAR markers out: 536
MT markers out: 126
TOTAL: 635165
Number of markers and samples at start of cleaning:
Samples in start: 5479
Markers in start: 647785
Remove markers with missingness > 10%:
Markers in: 647785
Markers removed: 246
Markers remaining: 647539
Remove individuals with missingsness > 5%:
Samples in: 5479
Samples removed: 2
Samples remaining: 5477
Remove markers with missingness > 5%:
Markers in: 647539
Markers removed: 1223
Markers remaining: 646316
Remove individuals with missingess > 3%:
Samples in: 5477
Samples removed: 19
Samples remaining: 5458
Remove markers with missingness > 2%:
Markers in: 646316
Markers removed: 7548
Markers remaining: 638768
Remove individuals with missingness > 2%:
Samples in: 5458
Samples removed: 9
Samples remaining: 5449
Remove autosomal markers with HWE p < 1e-7:
Markers in: 638768
Markers removed: 974
Markers remaining: 637794
Remove samples with HET excess > 4SD using common autosomal markers (MAF > 0.01)
Samples in: 5449
Samples removed: 4
Samples remaining: 5445
Remove autosomal markers with HWE p < 1e-6
Markers in: 637794
Markers removed: 123
Markers remaining: 637671
Remove samples with HET excess > 4SD using rare autosomal markers (MAF > 0.01)
Samples in: 5445
Samples removed: 34
Samples remaining: 5411
Remove markers with missingness > 2%:
Markers in: 637671
Markers removed: 5
Markers remaining: 637666
Temporarily remove samples failing sex check (F: 0.2, 0.8):
Samples in: 5411
Samples for removal: 2
Samples removed: 2
Samples out: 5409
Markers into sex clean:
X markers in: 15251
Y markers in: 712
PAR markers in: 564
MT markers in: 126
Remove chrX and PAR markers with HWE p < 1e-6 (only female):
Markers (X + PAR) in: 15815
Markers removed: 25
Markers remaining: 15790
Remove chrX marker if any male has at least one heterozygote genotype:
Markers removed: 477
Markers remaining 637164
Markers after sex clean:
Autosomes markers out: 621013
X markers out: 14774
Y markers out: 712
PAR markers out: 539
MT markers out: 126
TOTAL: 637164
Module 4 performs three steps 1) generate a list of samples for use in LMM analyses that are able to account for relatedness. 2) generate a list of unrelated samples by performing samples with excess relatedness and 3) calculates PCA covariates for the unrelated core to use in downstrem analyses
Markers and samples at beginning of module:
Markers start: 635165
Samples start: 11316
Remove markers not surviving QC in both parents and offspring:
Markers in: 635165
Markers removed: 427
Markers remaining: 634738
Remove samples with missingness rate > 2%:
Samples in: 11316
Samples removed: 42
Samples remaining: 11274
Remove samples with HET excess > 4SD using common autosomal markers (MAF > 0.01):
Samples in: 11274
Samples removed: 1
Samples remaining: 11273
Remove samples with HET excess > 4SD using rare autosomal markers (MAF < 0.01):
Samples in: 11273
Samples removed: 65
Samples remaining: 11208
Remove samples with excess accumulated PIHAT:
Samples in: removed 11208
Samples removed: 11
Samples remaining: 11197
Remove one in a pair of samples with PI_HAT > 0.1:
Samples in: 11197
Samples removed: 457
Samples remaining: 10740
PCA plot for unrelated core founders
Markers and samples at beginning of module:
Markers start: 637164
Samples start: 5577
Remove markers not surviving QC in both parents and offspring:
Markers in: 637164
Markers removed: 2426
Markers remaining: 634738
Remove samples with missingness rate > 2%:
Samples in: 5577
Samples removed: 33
Samples remaining: 5544
Remove samples with HET excess > 4SD using common autosomal markers (MAF > 0.01):
Samples in: 5544
Samples removed: 4
Samples remaining: 5540
Remove samples with HET excess > 4SD using rare autosomal markers (MAF < 0.01):
Samples in: 5540
Samples removed: 38
Samples remaining: 5502
Remove samples with excess accumulated PIHAT:
Samples in: removed 5502
Samples removed: 7
Samples remaining: 5495
Remove one in a pair of samples with PI_HAT > 0.1:
Samples in: 5495
Samples removed: 86
Samples remaining: 5409
PCA plot for unrelated core offspring
Module 5 prepares the cleaned data for pre-phasing and imputation. All samples from module 2 enters this module, but only the cleaned markers identified in module 3. To prepare for phasing and later imputation using HRC reference some preparation steps are needed.
1) markers above chr 23/X are removed
2) mendelian errors are set to missing (using available trios and duos).
3) the dataset is checked using Will Rayners preparation tool (http://www.well.ox.ac.uk/~wrayner/tools/)
4) dataset is split by chromosome.
Pre-phasing is performed using Shapeit2 v.837 (http://mathgen.stats.ox.ac.uk/genetics_software/shapeit/shapeit.html)
Imputation is performed by Sanger Imputation Server (https://imputation.sanger.ac.uk/)
Samples and markers into module:
Samples in: 17742
Markers in: 647785
Remove markers not passing QC for both offspring and founders:
Markers in: 647785
Markers shared: 634738
Markers removed: 13047
Markers remaining: 634738
Remove markers above chr 23:
Markers in: 634738
Markers removed: 1374
Markers remaining: 633364
Set mendelian errors to missing:
Mendelian errors zeroed: 186590
HRC harmonizing:
Markers in: 633364
Marker chromosomes changed: 0
Marker positions changed: 0
Marker strand flips: 37109
Marker allele flips: 575573
Markers excluded (not in HRC): 65089
Markers after exclusion: 568275
| Chromosome | N |
|---|---|
| 1 | 45268 |
| 2 | 46161 |
| 3 | 38636 |
| 4 | 35311 |
| 5 | 33424 |
| 6 | 39844 |
| 7 | 31182 |
| 8 | 29286 |
| 9 | 24176 |
| 10 | 28150 |
| 11 | 28293 |
| 12 | 27051 |
| 13 | 19613 |
| 14 | 18290 |
| 15 | 17219 |
| 16 | 18758 |
| 17 | 17044 |
| 18 | 16219 |
| 19 | 13131 |
| 20 | 13854 |
| 21 | 7705 |
| 22 | 8191 |
| X | 11469 |